Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Sequencing and Raw Sequence Data Quality Control ◾ 3

1.2 SEQUENCING

DNA/RNA sequencing is the determination of the order of the four nucleotides in a nucleic

acid molecule. The recovered order of the nucleotides in a genome of an organism is called

a sequence. Sequencing of the DNA helps scientists to investigate the functions of genes,

roles of mutations in traits and diseases, species, evolutionary relationships between spe-

cies, diagnosis of diseases caused by genetic factors, development of gene therapy, criminal

investigations and legal problems, and more. Since the nucleotides are distinguished by the

bases, the DNA and RNA sequences are represented in bioinformatics by the sequences of

the four-nucleobase single-character symbols (A, C, G, and T for DNA and A, C, G, and

U for RNA).

The attempts to sequence nucleic acid began immediately after the landmark discovery

in 1953 of the double-helix structure of the DNA by James Watson and Francis Crick.

The alanine tRNA was the first nucleic acid sequenced in 1965 by the Nobel prize winner

Robert Holley. Holley used two ribonuclease enzymes to split the tRNA at specific nucleo-

tide positions, and the order of the nucleotides was determined manually [1]. The first DNA

molecule was sequenced in 1972 by Walter Fiers. That DNA molecule was the gene that

codes the coat protein of the bacteriophage MS2, and the sequencing was made by using

enzymes to break the bacteriophage RNA into pieces and separating the fragments with

electrophoresis and chromatography [2]. The sequencing of the alanine tRNA by Robert

Holley and the sequencing of the gene of the bacteriophage MSE coat protein are ones of

the major milestones in the history of genomics and DNA sequencing. They paved the way

for the first-generation sequencing.

1.2.1 First-Generation Sequencing

The early 1970s witnessed the emergence of the first-generation sequencing when the

American biologists Allan M. Maxam and Walter Gilbert developed a chemical method

for sequencing, followed by the English biochemist Frederick Sanger who developed the

chain-terminator method. The Sanger method became the more commonly used first-

generation sequencing method to this date. Both methods were used in the shotgun

sequencing, which involves breaking genome into DNA fragments and sequencing of the

fragments individually. The genome sequence is then assembled based on the overlaps after

aligning the fragment sequences.

The Maxam–Gilbert sequencing method is based on the chemical modification of DNA

molecules and subsequent cleavage at specific bases. In the Maxam–Gilbert sequencing

method, first, the DNA is denatured (separation of the DNA strands) by heating or helicase

enzyme into single-stranded DNA (ssDNA) molecules. The ssDNA is run in the gel elec-

trophoresis to separate the two DNA strands into two bands. Any one of the bands (strand)

can be cut from the gel and sequenced. In the sequencing step, the solution with ssDNA is

then divided into four reaction tubes labeled A+G, G, C+T, and C. The ssDNA in each tube

is labeled chemically with an isotope and treated with a specific chemical that breaks the

DNA strand at a specific nucleotide according to the tube labels. After the reaction, poly-

acrylamide gel is then used for running the four reactions in four separate lanes (A+G, G,